Detecting Text Reuse with Modified and Weighted N-grams
نویسندگان
چکیده
Text reuse is common in many scenarios and documents are often based, at least in part, on existing documents. This paper reports an approach to detecting text reuse which identifies not only documents which have been reused verbatim but is also designed to identify cases of reuse when the original has been rewritten. The approach identifies reuse by comparing word n-grams in documents and modifies these (by substituting words with synonyms and deleting words) to identify when text has been altered. The approach is applied to a corpus of newspaper stories and found to outperform a previously reported method.
منابع مشابه
N-gram Overlap in Automatic Detection of Document Derivation
Establishing authenticity and independence of documents in relation to others is not a new problem, but in the era of hyper production of e-text it certainly gained even more importance. There is an increased need for automatic methods for determining originality of documents in a digital environment. The method of n-gram overlap is only one of several methods proposed by the literature and is ...
متن کاملUtterance Segmentation Using Combined Approach Based on Bi-directional N-gram and Maximum Entropy
This paper proposes a new approach to segmentation of utterances into sentences using a new linguistic model based upon Maximum-entropy-weighted Bidirectional N-grams. The usual N-gram algorithm searches for sentence boundaries in a text from left to right only. Thus a candidate sentence boundary in the text is evaluated mainly with respect to its left context, without fully considering its rig...
متن کاملDetecting Co-Derivative Documents in Large Text Collections
We have analyzed the SPEX algorithm by Bernstein and Zobel [1] for detecting co-derivative documents using duplicate n-grams. Though we totally agree with the claim that not using unique n-grams can greatly increase efficiency and scalability of the process of detecting co-derivative documents, we have found serious bottlenecks in the way SPEX finds the duplicate n-grams. We propose a solution ...
متن کاملDetection of New Malicious Code Using N-grams Signatures
Signature-based malicious code detection is the standard technique in all commercial anti-virus software. This method can detect a virus only after the virus has appeared and caused damage. Signature-based detection performs poorly when attempting to identify new viruses. Motivated by the standard signature-based technique for detecting viruses, and a recent successful text classification metho...
متن کاملHashing and Merging Heuristics for Text Reuse Detection
This paper describes a joint software entry by King Fahd University of Petroleum & Minerals and the University of Sheffield for the text-alignment task at PAN-2014. We employ the three steps of seeding, extension and filtering for text alignment. For seeding we use character n-grams with a variant of the RabinKarp Algorithm for multiple pattern search. We then use an elaborate merging mechanism...
متن کامل